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AAAI-93) , agents that assist with World Wide Web browsing (H. 
Lieberman, 1995, Letizia: An agent that assists web browsing, 
Proc. 15th Int. Joint Conf. on A.I. pp. 924-929; Robert 
Armstrong et al. , 1992, Webwatcher: A learning apprentice for 
5 the world wide web, Working Motes of the AAA! Spring 
Symposium: Information Gathering from Heterogeneous, 
Distributed Environments, pp. 6-12, Stanford University, AAAI 
Press), agents that schedule meetings (Lisa Dent et al., 
1992, A personal learning apprentice, Proc. 10th Nat. Conf. 
10 on A.I., pp. 96-103; Pattie Maes, 1994, Agents that reduce 
work and information overload, Comm. of the ACM 32(7): 31-40, 
146; Tom Mitchell et al., 1994, Experience with a learning 
personal assistant, Comm. of the ACM 37 (7) :81-91) , and agents 
that perform internet-related tasks (O. Etzioni et al., 1994, 
15 A scftbot-basod interface to the internet, CACM 37 (7) :72-75) . 
Increasingly, the information such agents need to access is 
available on the World Wide Web. Unfortunately, even a 
domain as standardized as the WWW has turned out to pose 
significant problems for automatic software agents. For one, 
20 although Web pages are universally written in Hypertext 
Markup Language ("HTML"), this language merely defines the 
format of information display, making no attempt to hint at 
its meaning or semantic content. Currently, no accepted 
"semantic sarkup language" for the Web exits, nor is one 
2S likely to adopted universally. The Internet can be expected 
to pose even greater problems. 

Thus, the advent of intranets, the Internet, and the 
World Wide Web have posed several fundamental problems for 
the automatic services or agents designed to assist users to 
30 find relevant information. First, no one such service has 
heretofore provided sufficient additional value to replace 
the use of a Web browser having access to existing 
directories and indices such as Yahoo or Lycos. Second, such 
services have not yet been able to understand and competently 
35 parse relevant information from the responses returned from a 
wide variety of Internet and Web information sources. Third, 
existing services and agents have not been easy to adapt to 
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METHOD AND 8YBTEM 
FOR NETWORK IKFOP M^TTQN ACCESS 

1. FIELD OF TMvEKTION 

The field of this invention relates to information 
5 access over networks, and specifically to the automatic 
location and evaluation of relevant information available 
over public or private networks from information sources in 
response to user queries. 

10 2. BACKGROUND 

The exponential growth of private intranets and the 
public internet has produced a daunting labyrinth of 
increasingly numerous documents, databases and utilities, 
-most sr.v type of information is now available somewhere, 
13 "but mort users cannot find what they seek, and even expert 
users waste copious time and effort searching for appropriate 
information sources. One problem is simply the increasingly 
large number of available information sources that are beyond 
the comprehension of a single user. A second problem, along 
20 with this growth in available information and information 
sources, is a commensurate growth in software utilities and 
methods to manage, access, and present this information. 
Each utility has a different and often unique interface and 
set of commands and capabilities, and is appropriate for a 
" different set of users and a different set of information 
types and sources. Thus sheer diversity of available 
utilities creates problem for users comparable to that 
created by information explosion. Users are now faced with 
the twin problems of which tool to use to inquire at which 

30 

information source. 

in the past efforts have been made to provide users with 
automatic, computer assisted services that can help solve 
these twin problems of the network revolution. For example 
AI researchers have created several prototype software agents 
that help users with e-mail and netnews filtering (Pattie 
Maes et al., 1993, Learning interface agents. Proceedings of 



35 
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only the most relevant information to the user. On the one 
hand, each user query is forwarded only to the primary 
information sources determined to be most relevant. On the 
other hand, information source responses are parsed and 
5 understood so that only the relevant data items are extracted 
for user presentation. Duplicate, stale, and irrelevant 
information items are discarded. Second, a netbot is fast. 
Since it automatically searches the relevant primary sources 
in parallel, it can present information as quickly as the 
10 fastest primary source returns a response. Despite changing 
conditions which cause different information sources to 
fluctuate in speed, a netbot integrator remains as fast as 
the fastest source. Sources that have no information to 
return to a query do not slow the user since the netbot 
15 -i=ply ignore- the®. Third, r.etbcts m easily adapted to 
" the" ever-increasing number of network information sources 
with ever-changing response formats. Netbots utilize a new 
and novel declarative language for describing informatxon 
sources. A source description is short and easily 
20 understandable, and therefore is easily written and 
maintained. 

Therefore, in one aspect the invention includes a method 
for efficient access to information sources on a network 
comprising preferably ona or more of the following steps: 
25 receiving a user query for information; determining the 

information sources most relevant to this query; retrieving a 
description of each information source; formatting the query 
according to this description in a manner suitable for each 
information source and transmitting the formatted query to 
30 the source; receiving responses from the information sources; 
for each source, understanding and extracting the relevant 
data fields according to the retrieved description; and 
presenting to the user the relevant data from each , 
information source in an intelligent manner ranked by an 
35 estimate of its relevance. Advantageously, these steps are 
performed in parallel to the greatest extent possible. In 
particular, at least, all queries are transmitted to all 
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the ever- increasing numbers of sources with their ever- 
changing response formats. This is due to the 
individualized, hand-coded interface to each Internet service 
and Web site utilized by existing agents (Yigal Arens et al., 
5 1993, Retrieving and integrating data from multiple 

information sources, International Journal on Intelligent and 
Cooperative Information Systems 2 (2) : 127-158 ; o. Etzioni et 
al., 1994, A softbot-based interface to the internet, cacm 
37(7) :72-75; B. Krulwich, 1995, Bargain finder agent 
10 prototype, Technical report, Anderson Consulting; Alon Y. 

Levy et al. , 1995, Data model and query evaluation in global 
information systems, Journal of Intelligent Information 
Systems, Special Issue on Networked Information Discovery and 
Retrieval 5(2); Mike Perkowitz et al., 1995, Category 
15 Uanalc.Lion: Learning to understand information or. the 

internet, Proc. 15th Int. Joint conf. on A.I.) . Preferably, 
a service or agent should be able to access a new or changed 
internet information source in order to automatically learn 
how tc retrieve relevant information from the source. This 
20 would be advantageous even if such a facility were limited to 
groups of sources with response formats selected according to 
certain constraining principles. 

3. HtTKMARY OP ^HE IN VENTION 

25 It is a broad object of this invention to solve these 

fundamental problems by a method and system that provide a 
personalized network robot, called a "netbot." A netbot 
acts as a user's intelligent assistant by tracking available 
network information sources, knowing the relevant information 
30 and features of each particular source, and upon user request 
determining which sources are relevant to a given query, 
forwarding the query to the most relevant information 
sources, understanding the responses returned from each 
source, and integrating and intelligently presenting the 
35 query results to the user. 

The netbots of this invention possess several 
advantages, including the following. First, a netbot returns 
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Figs. 2A-B illustrate exemplary user interfaces of 
embodiments of the netbot of Fig. 1; 

Fig. 3 illustrates exemplary functional components of 

the netbot of Fig. l; and 
5 Fig. 4 illustrates alternative hardware embodiments of 

the netbot of Fig. 1. 

5. DETAILED nEECFIPTION 

For clarity of disclosure, and not by way of limitation, 
10 the detailed description of a netbot of this invention is 
presented as a method or process for network information 
access, as a system or apparatus implemented to perform that 
method, and as a novel language designed to assist in 
implementing this method and system, 
l- \n the £ol Irving, first an overview of these aspects of 

the invention is presented followed, second, by a detailed 
discussion of individual components. 

5.1. OVERVIEW Qy MET3QT EttCHTTECTURE 
2 0 A netbot method or system of this invention comprises 

software and hardware facilities that function together in 
one or more network attached computers to assist a user in 
access to information present in network attached servers 
(known herein as "information sources") . Fig. l generally 
25 illustrates the relationships of a netbot to a user and to 
networked information sources. For example, user l accesses 
user computer 3 through standard interface devices, such as 
monitor 2. In the course of work, the user needs information 
from information sources 7, attached to the user computer 
30 through various network links, such as network links 4 and 6. 
Since the information sources are many, the user can benefit 
from assistance in finding needed information from relevant 
information sources. This assistance is provided by netbot 
5, which maintains an awareness of available information 
35 sources and queries them through links 6 on behalf of, or as 
an agent of, the user. Alternatively, netbot 5 can partly or 
wholly reside on user computer 3 or be partially or wholly 
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relevant information sources in parallel without waiting for 
intervening responses. 

In another aspect the invention comprises a computer 
system and apparatus for performing one or more steps of the 
5 method of this invention. The user has a presentation device 
attached to a network to which is also attached a plurality 
of information sources. The presentation device receives 
user queries and displays netbot responses. Further, the 
presentation device performs one or more of the steps of the 
10 method of this invention. One or more of those steps not 
performed on this device can advantageously be performed on 
network attached netbot server computers, which respond to 
functional requests from the user device. Optionally, the 
user device can range from a diskless hand-held terminal, to 
3,*> a PC, to a workstation, and so fortn. 

In a further aspect the invention comprises a new 
language and language implementation to facilitate the 
creation and maintenance of descriptions of information 
sources. Importantly, this language recognizes relevant data 
20 fields in responses returned from information sources and is 
capable of extracting all such fields. In the preferred 
embodiment this language has an action statement component 
and a regular expression component. The regular expression 
component has novel features for creating modular 
25 hierarchical descriptions of regular expressions, for binding 
variables to the correct sub-strings recognized during 
pattern match to a response of an information source, for 
performing arbitrary action language statements with multiple 
variable bindings, and for specifying backtrack free 
30 recognition of sub-strings where possible. 

4. BRIEF DESCRIPTION OP THE DRAWINGS 
These and other features, aspects, and advantages of the 
present invention will become better understood by reference 
35 to the accompanying drawings, following description, and 
appended claims, where: 

Fig. 1 illustrates generally a netbot of this invention; 
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present in or accessed by a netbot. For example, a learning 
module can be present which provides for a netbot to acquire 
automatically the characteristics of a new network 
information source. Finally, as is known to those of skill 
5 in the art, the functions performed by the described modules 
may be divided or grouped in alternative fashions among a 
greater or lesser number of modules. 

In the absence of specific preference3, the processes of 
this invention can be implemented in complete procedural 
10 programming language, such as C, or a complete object 

oriented programming language, such as C++, on the disclosed 
hardware configurations. 

nnrv TTRFR INTE RFACE MODULE 
13 ir. core detail, the user interface module has both 

important functionality that is common to netbot user 
interfaces, whatever the information domain to which netbots 
are directed, and also has adaptations to the particular 
information domain of a particular netbot. Turning first to 
20 the preferable common functions, one such is the ability to 
remember a user's preferences for interacting with a netbot. 
such remembered preferences include, for example, whether a 
netbot is to fetch pages in order to calculate their 
relevance itsell, the number H or relevant information 
25 sources to query, display characteristics, etc. A second 
such common function is incremental display of information 
received from a query. Since each query usually causes a 
netbot to consult many independent information sources, 
results are often received at widely varying times. 
30 immediate display of asynchronously received results would 
cause such undesirable effects as screen flicker or 
disorienting rearrangement of already displayed data, 
incremental display accommodates, on the one hand, user 
desires to view information quickly with, on the other hand, 
35 user desires for a comprehensive view of all received 
information, sorted according to relevance. 
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distributed on the network and accessed by the user through 
link 4. 

Groups of sources 7 having similar sorts of information 
are grouped into conceptual classes called information 
5 domains . For example, one domain can be that of electronic 
stores for a particular product; another domain might include 
Internet indexes containing information on the keyword 
content of various World Wide Web ("WWW") pages . 

In a preferred embodiment, the netbot is composed of 
10 three major functional modules: a user interface, an 

integrator, and an I/O manager. Briefly, the user interface 
module interacts with the user to receive user queries for 
information, and to format and present information responses 
received from the network attached information sources, 
xo Advantageously,* the user interface i* adapted to the specific 
information domain being accessed. The integrator module 
accepts a user query from the user interface module, selects 
relevant information sources, formats it for network 
transmission to each relevant information source, receives 
20 responses from these sources, understands these responses, 
and passes the relevant portions of the responses back to the 
user interface module for display to the user. The I/O 
manager module performs hardware, operating system, and 
network specific interfacing for the user interface and 
25 integrator modules so that porting a netbot to different 
hardware platforms, operating systems, or networks requires 
changes only in well-modularized code. 

In particular alternative implementations, either or 
both of the user interface or the I/O manager modules may be 
30 absent. For example, the functions of these modules might be 
already performed by other operating system components. A 
netbot can be provide only one or more of the facilities 
disclosed without providing others. For example, if\ some 
embodiments, a netbot is useful that preforms query routing, 
35 alone or in combination with relevance ranking. In other 
embodiments, a netbot can simply format queries and 
understand responses. Further, additional modules may be 
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domain for which it is designed. For example, a netbot for 
comparison shopping in an information domain of electronic 
stores on the Internet can have a particular interface 
presentation containing labeled fields for product name, 
5 model, and price. On the other hand a netbot for access to an 
information domain of online Internet indexes can have an 
interface with fields labeled with elements derived from the 

query- 

in one netbot embodiment, most functions and modules 
10 reside on network attached servers which a user accesses 

remotely. For example, the user may access a netbot over the 
internet with the World Wide Web protocols utilizing a web 
browser, such as Netscape. In this case the user interface 
builds HTML formatted pages which are transmitted over the 
x* network by the I/O manager. Fig. 2A generally illustrates 
the user display from an example of such an embodiment, which 
is further directed to the information domain of onlme, 
electronic software stores. The netbot display of Fig. 2 A is 
divided into three sections. Section 11 is a title section 
20 generally indicating that this display has results from a 
netbot for shopping, a "ShopBot." A netbot preferably also 
has a specific input query screen. Section 12 presents the 
list of online stores currently being consulted represented 
by tneir WWW addresses 15, which are selectable to provide 
25 further information or direct WWW access. At 16 in section 
12 those sources which have already returned query results 
are similarly represented. Section 13 presents the results 
received so far formatted in accordance with this particular 
information domain into sections for the major PC operating 
30 systems. Each individual item returned, for example item 16 
is formatted with product name, price, and an address for the 
originating information source. In this implementation 
information display is controlled with the window scrolling 
and control facilities built into the web browser. This user 
35 interface is implemented as a HTML formatted page created 
a netbot server and transmitted to the web browser. 
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The incremental display strategy preferably provides one 
or more windows on the user's screen with several defined 
views of the query satisfaction process along with certain 
common user controls, such as screen buttons, for 
5 manipulating these windows* In one window, the user 

interface module presents lists of the information sources 
being consulted with each source symbolically represented as, 
for example, a network address, an icon, or another compact 
screen representation . Then, as an answer is received from a 

10 particular source, its associated screen representation 
changes appearance, e.g., by changing intensity, changing 
color, etc* Also displayed is a count of the total number of 
unique information items received currently. Optionally, 
clicking on the screen representation of an information 

15 source opens a further window with either information about 
thit information source, or a display of the responses 
received from it, or access to the information source over 
the network, etc. 

One of the common controls is preferably implemented as 

20 a common show-me-button that when activated causes display of 
another window in which a list of all currently received 
responses is presented ranked according to their estimated 
relevance to the query. Another common control is preferably 
implemented as a more-button in this latter window that wnen 

25 activated, causes re-display of all prior data items merged 
with items newly received since the last window display. The 
newly received items are merged into the display list in 
order or relevance, and distinguished from prior items by, 
for example, their color or intensity so that the user can 

30 avoid scanning prior items again. Optionally, clicking on a 
data item opens a still further informational window giving 
either the source of the item, a display of the response 
containing the item, or access over the network to f the source 
of the item, etc. 

35 In addition to such common functions and controls, a 

netbot user interface module preferably implements specific 
designs, formatting, and fields suitable to the information 
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user interface is implemented as a Java applet downloaded 
from a netbot server and executed by the web-browser. In 
this manner, the interface of Fig, 2B is capable of greater 
interactivity than that of Fig. 2A. For example, it can poll 
5 the netbot server for current search status and update the 
status displays accordingly without user action. 

Although the user interface is described primarily in 
terms of windows and buttons, one of skill in the art will 
recognize that this invention is adaptable to other display 
10 paradigms that provide for display of information and input 
of user commands. For example, the user interface module can 
control the entire screen and present graphical displays 
without intervention of a windowing system. 

The user interface module is preferentially implemented 
IS with ar. ooject rsrier.ted programing language supplemented 
with class library providing -windowing functions. A 
preferable implementation uses the Java language together 
with the java.awt package. See, for example, Flanagan, 1996, 
Java In A Nutshell . O'Reilly & Associates, sections 5 and 19. 



20 



THE INTEGRATOR MODULE 

Fig. 3 illustrates the preferred functional modules, 
data bases, and their functional interrelationship of netbot 
30 in general and of an integrator module 37 in particular. 
25 The integrator preferably consists of three functions: a 
query router 39, a wrapper database 40, and an aggregation 
engine 38. These components are introduced here and 
described in detail in the following. Given user query 31 
delivered from user interface module 34, the integrator first 
30 calls query router 39 to rank the network information sources 
known to a netbot in order of relevance and to return the N 
more relevant sources. Next the integrator retrieves the N 
wrappers for the N most relevant sources from wrapper , 
database 40. These wrappers, which are descriptions of the 
3S information source and its requirements, are written in the 
wrapper description language of this invention. The 
retrieved wrappers are used by aggregation engine 38, first, 
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Fig* 2B generally illustrates the user display from 
another example of a web-browser-based user interface 
embodiment r which in this case, is directed to an information 
domain of WWW indexes or search engines. This display is 
5 also generally divided into three sections. Section 71 
displays a title for the netbot; section 12 displays the 
status of the current search; and section 7 3 displays the 
search results. In more detail, the display of section 71 
includes a logo for this netbot, "MC" standing for 

10 "MetaCrawler , M a name chosen since WWW search engines are 
also known as "web crawlers," and controls to access certain 
system level presentation features, such the MetaCrawler home 
page and user feedback pages. The display of section 7 2 
includes list 74 of the search engines being queried 

l$ identified by their cordon n^mes , th° status of the current 
query in general and at each search enginn, and common user 
controls. Generally, pie-chart icon 78 summarizes that 7 of 
the 8 search engines queried have already responded to the 
query. At search engine 75, known as "Lycos," the check mark 

20 indicates that a response containing information items has 
already been received. At search engine 76, known as 
"Inktomi," the cross mar* indicates that a response without 
any information items has already been received. On the 
other hand, search engine 77, known as "Galaxy," is visibly 

25 distinguished from the other search engines to indicate that 
it has not yet responded to the query. The common controls 
of section 72 include more-button 79 to request the display 
of newly arrived search results, and modify-search-button 80 
to request a new or modified query be sent. Lastly, the 

30 display of section 73 includes the information items returned 
from the search engines. Each information item is displayed 
separately and includes title 81, descriptive text 82 if 
available, and line 83 with the URL of the web page for this 
item and the estimated relevance of this item to the query, 

35 here "1000." The items are sorted for display by descending 
values of the estimated relevance. The displayed items are 
scrolled using controls provided by the web browser. This 
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each query to the conceptual classes relevant to it, and 
finds information sources with conceptual classes shared by 
the query. The mapping of a query to its conceptual classes 
is preferably done with a hash function. 

5 

ft/znpre&TTqN ENGINE 

The aggregation engine is the coordinating function of 
the integrator module. It receives the user query from the 
user interface module and requests the query router to 
10 provide a list of the N information sources most relevant to 
the given query. Then it retrieves the N wrappers for the N 
information sources from the wrapper database. Guided by the 
N wrappers, the aggregation engine translates the query into 
the request formats accepted by each of the N information 
IS screes and transfers the K requests to the I/O manag-r foi 
network transmission. For some sources, the query may be in • 
the format of a form to be returned. When a response , 
received from an information source, the aggregation engine, 
again guided by the appropriate wrapper, extracts data from 
20 the response and places it into a list of data fields, called 
a tuple format, relevant to the particular information 
domain. Optionally, each tuple can be assigned a priority 
order using a method appropriate to the particular 
information domain. Finally when the incremental display 
25 manager requests data to present to the user, perhaps in 
response to a more-button request, the aggregation engine 
passes the tuples to the user interface module, sorted in 
priority order if a priority is determined 

For example, if the information domain relates to 
30 internet online software vendors, then the tuples optionally 
contain such relevant fields as product name, manufacturer, 
software version number, operating system required, price, 
etc. An exemplary priority order of the tuples can be, by 
price, by delivery delay, or other factor at user P"" ren ~; 
35 For a further example, if the information domain relates 

to world Wide Web ("WWW") search engines, which index 
information pages available generally on the WWW, then the 
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to format the query into forms recognized by each information 
source, and second, to understand the information returned by 
each information source 33 in order to eliminate extraneous 
formatting matter and to put the received information into a 
5 common format . This formatted information is then aggregated 
and passed to the user interface module for presentation 32 
to the user according to preferred incremental display method 
36. The user display is also controlled by stored user 
preferences 3 5 . 

10 

THE QUERY ROUTER 

A well-behaved netbot should preferably use scarce 
network bandwidth and information source processing resources 
in a competent and efficient and frugal manner, while at the 

ib same time besr. answering eacn user query* Such behavior 
minimizes resource usage, and thus achieves best overall 
performance, both for the individual netbot and for all 
netbots functioning simultaneously on a network. 

The query router is important to achieving this behavior 

20 because it permits the netbot to send requests only to 

information sources likely to have information relevant to a 
query. From a user query, the query router determines the 
relevance of each information source to the given query and 
returns the N most relevant sources. N is a parameter 

25 controllable at user preference and can be as small as 1. 
This relevance determination is preferably over inclusive 
rather than under inclusive. Occasionally including an 
irrelevant information source is preferable to missing a 
relevant and important source. Further, it is preferable 

30 that this relevance determination be quick to compute, not 
requiring costly processing techniques. 

In a preferred embodiment, the query router calculates a 
numerical relevance rank value for each information source 
that estimates the source's relevance. This calculation is 

35 based on the concept of "conceptual classes. 19 Thus each 
information source is tagged in advance with the conceptual 
classes for which it is relevant. Then the query router maps 
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be simply determined from the values of one or more numeric 
fields of the response tuples. 

THF. WRAPPER DATABASE 
5 The preferred manner of describing information sources 

and their capabilities, in particular their query formats and 
response formats, is with compact, modular, declarative 
descriptions called wrappers. Since a netbot can access from 
several hundred to many thousands of information sources, the 
10 descriptions of the sources are preferably compact, requiring 
a minimum of storage. Further, since new information sources 
are frequently created and existing sources frequently change 
their format, easy maintenance of source descriptions is 
important. A modular, declarative description, instead of a 
ib complex procedural description, lacilitates such maintenance, 
in one embodiment of this invention, wrappers can be learned 
by a separate module for information sources having 
sufficiently regular formats. 

For each information source, in an exemplary embodiment, 
2 0 each wrapper advantageously includes the following 
information: 

1. The Universal Resource Locator ("URL") address of the 
information source; 

2. The conceptual classes of the source; 

25 3 A description of the mapping from query arguments, e.g. 

words or phrases, to fields of the query or HTML defined 
form used to interrogate the source (including site 
support for any, all, phrase, or proximity queries); 
4. a description of the format of the query response or 
30 HTML page layout that enables parsing of relevant 

information from other information and extraneous 
formatting matter. 
At least items 3 and 4 are advantageously written in the 
wrapper description language of this invention. 
35 A netbot can retrieve wrappers in various manners. In 

one embodiment, the information source itself can supply its 
own wrapper upon request from a netbot. In an alternative 
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tuples optionally contain such relevant fields as title of 
the indexed page, the universal resource locator ("URL") of 
that page, relevance scores estimating the relation of the 
indexed page to the query, descriptive text, etc. An 
5 exemplary priority order can be based on the netbot's 

normalized estimate of the relevance of the indexed page to 
the query. If the netbot does not retrieve the indexed page, 
then it sums the normalized relevance estimates for this 
response that are returned from each of the search engines. 
10 If a search engine does not return a relevance estimate, a 
default value is used. The obtained relevance estimates are 
then normalized by linearly adjusting the returned scores to 
have a common maximum of, e.g., 1000, and then multiplying 
the adjusted scores by a confidence factor. This confidence 
15 factor, ranging from 0 to 1, is a predecermined estisats or 
the reliability of a particular information sources own 
relevance estimates. For example, it can be determined by 
practical experience with the information source's relevance 
estimates. Alternately, the user can request the netbot to 
20 retrieve the page in order to do its own relevance estimate. 
In an exemplary embodiment, for queries requesting the 
presence either of all query words or of any query words, the 
estimate is determined by scanning the page and counting the 
number of query words actually present, and then scaling the 
25 count so that the presence of all words results in the common 
maximum relevance value. For queries requesting the presence 
of a phrase, the estimate is determined, for example, by 
subtracting from the common maximum a normalized sum of the 
square of the distance in the page of each word of the phrase 
30 from its successor word in the phrase. Thereby, if the 
phrase appears contiguously in the page the relevance is 
high, whereas if the words of the phrase are widely separated 
on the page, the relevance is low. 

In summary, for the majority of information domains, the 
35 priority order is determined from a relevance computation, as 
in the WWW search engine example. However, for certain 
domains such as online software vendors, a priority order can 
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includes three fields of relevance, denoted by (stuff) and 
referred to sequentially by $0, $1, and $2, that when an item 
is recognized are output by the "output ()" statement. In 
detail, an item is specified to consist of stuff, then the 
5 string <dt>, then more stuff, then the string '="', then the 
first field of interest later referred to by $0, then the 
string ' "><strong>^ , then the second field of interest later 
referred to by $1, then the string "</strongx/a>'\ then more 
stuff, then the string "<dd>", then the third field of 
10 interest later referred to by $2, then the string "<br>". 
The action statement within braces, in this case specifying 
output of the variables $0, $1, and $2 with the bindings when 
an <item> is matched, is executed when each item is 
recognized. 

IS 

THE NETBOT SYSTEM 

The preferred functional structure of a netbot can be 
assigned to system hardware components in various 
alternatives. The preferred alternative in any case depends 
2 0 on which allocation of function achieves a rapid response and 
reasonable cost. Fig. 4 generally illustrates exemplary 
netbot hardware embodiments and options in view of the 
previous general description. It illustrates the 
interrelationship of user computer elements 51-56, network 
25 57, information sources 58, and netbot server computers 59- 
61. Computer 51 is a user computer including a processor, 
memory, and various attached peripherals. Such peripherals 
include display device 52, or other device for user 
interaction, network attachment 54, optional hard disk 
30 storage 53, and so forth. Computer 51 can be alternatively a 
network device without permanent storage, a PC, a 
workstation, or more powerful computer. It is preferred that 
computer 51 be a PC or a workstation running one of the 
Windows operating systems, the Macintosh operating system, or 
35 UNIX. Present in the memory of user computer 51 is, among 
other software, local netbot software 55 and local system 
components 56. The local netbot software implements one of 
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embodiment, the netbot can provide its own wrappers in 
various manners. For example, wrappers can be built in the 
netbot itself, especially where the netbot accesses only a 
few information sources. Also, wrappers can be stored in a 
S local database or can be downloaded on demand from a 
centralized database. 

The wrapper description language (hereinafter referred 
to as the "WDL") of this invention facilitates the semantic 
description of queries, forms, and pages by using a 

10 declarative description format that combines feature from 
grammars and regular expressions. Here an example of this 
description language is presented. A detailed description is 
set forth in a later section. Syntax used follows 
conventions known to those of skill in the art for specifying 

15 grammars, including regular expressions. See, e.g., 

Schwartz, 1993, Learning Perl , O'Reilly & Associates, Inc., 
chapter 7; Aho et al., 1986, Compilers Principles , 
Techniques, and Tools . Addison Wesley Publishing Co., section 
3.3 

20 An exemplary description in WDL of a typical page 

returned from an WWW search engine follows here. The WDL 
interpreter uses the- page* description to parse a page and to 
execute any specified action statements. Note that "stuff" 
is a reserved word in the WDL that matches any character 

25 string up to the first occurrence of a mandatory following 
string literal* 

<page>::- stuff "<dl>" <item>* .*$ 

<item>::= stuff "<dt>" stuff (stuff) "\»><strong>" 

30 (stuff) "</strongx/a>" stuff M <dd>" (stuff) 

••<br>" { output ($0, $1, $2, 500) } 

This describes a page made up of, among other data,, a 
sequence of zero or more items. In detail, a page is 
35 specified to consist of stuff, then the string <dl>, then 
zero or more items, then zero or more characters ("."), then 
finally the end of the page ("$"). In general, an item 
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aggregation engine, which makes query routing requests to 
query server 59 and wrapper requests to wrapper server 60. 
Further, it can include one or both of these latter 
functions . 

5 The various computers of a netbot system can be provided 

with software for performing the methods of this invention 
either from computer readable media or by loading across a 
network. This invention is adaptable to known magnetic and 
optic media, such as disks, tapes and CD-ROM. 

10 

5.2. TWTt T fO MAKAGER 
The I/O manager module performs hardware, operating 
system, and network specific interfacing for the integrator 
module. Network interfacing includes the tasks of sending 
15 re~uests and receiving responses from network attached 

information sources. An important application of the netbots 
- of this invention is to information retrieval over the 

internet. In this application the I/O manager is responsible 
for implementing the relevant protocols of the WWW, Gopher, 
20 FTP, internet tools, etc. Optionally, it can temporarily 

cache pages and other data in order to improve response time. 

operating system interfacing includes the task of window 
management for the user interface module and access to the 
wrapper database, if present. 

Preferably, the I/O manager is constructed from 
commercially available protocol stacks, windowing libraries, 
S uch as the Java.awt package, and other tools, in some 
implementations, more or less of the I/O manager functions 
can be performed by other system components on the network 
30 attached computer. Optionally, the I/O manager is designed 
to be scalable to multiple machines, to not require multi- 
threaded or reentrant code, and to be cross platform and 
persistent. 



25 



35 
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more of the netbot functions. The local system components 
can include, for example, a web browser. 

Network 57 can be any network with a plurality of 
attached information sources 58, which are can be optionally 
5 conceptually classified by subject matter into information 
domains. In a preferred embodiment, network 57 is the public 
Internet or a private intranet supporting the TCP/IP suite of 
protocols, including such user level protocols as FTP, HTTP, 
and so forth. The information sources are server computers 
10 which make their stored information available using the 
protocols supported by network 57. Such information can 
include databases of periodicals, newspapers, etc., 
information on or produced by particular commercial, 
educational, or other types of organization, facilities for 
lb electronic commerce, etc. 

In such a network, a netbot can have various 
embodiments. In an entirely local embodiment, all netbot 
functions reside in local netbot software 55 on user computer 
51, which in this embodiment must have sufficient processing 
20 and storage capabilities. In alternative embodiments, one or 
more of the disclosed netbot functions can be distributed on 
other network attached computers. 

For example, computer 59 is a wrapper server for 
accepting requests for downloading wrappers from its wrapper 
25 database. The wrapper database can be stored in memory or on 
disk using any data management system capable of storing and 
retrieving compact textual descriptions. Computer 60 is a 
query server for performing query routing by accepting 
queries and returning the N most relevant information sources 
30 from the many tens or hundreds of thousands about which it 
stores information. Computer 61 is a netbot server for 
performing the integrator module function by accepting user 
queries and returning search results, perhaps using the 
facilities of wrapper server 59 and query server 60. With 
35 these network servers, local netbot software preferably only 
supports the user interface, which may be delegated entirely 
to a web browser. Alternately, it can further include the 
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When multiple information sources are queried, it is 
preferable in step 6 to present to the user interface module, 
and thereby to the user, a single merged list of tuples 
extracted from the responses and sorted according to an 
5 estimate of the significance or relevance of each tuple to 
the user- Such a estimate is preferably made according to 
methods specific to the information domain to which the 
netbot is directed- For certain domains, a significance 
estimate can be made directly from the value of one or more 

10 data fields in the tuples. For example, in a domain of 
electronic shopping, significance can be related only to 
price or delivery date, according to user preference- For 
most domains, however, a significance estimate is made 
according to relevance of the returned information, which 

15 must be determined by examining the responses from each 
sources. 

In a preferred embodiment of such a relevance 
determination, the user has the choice of whether or not to 
have the netbot examine all information pages itself. If the 
20 user so chooses, the relevance is determined by the netbot 
according to a domain specific Analyze function. In a domain 
of information queries, an exemplary Analyze function finds 
the number and location of query words in the returned 
response. For keyword queries, responses with more keywords 
25 present with greater frequency are more relevant- For phrase 
based queries, responses having the words of the phrase more 
closely spaced, for example in one sentence or even in 
contiguous sequence, are more relevant- In other domains, 
appropriate Analyze functions are provided. 
30 If the user chooses not to have the netbot examine the 

responses, the netbot relies on relevance estimates returned 
from the information sources. If a particular source does 
not return a relevance estimate, a default value is used. 
These estimates are then normalized to be between, e.g., 0 
35 and 1000, and multiplied by a confidence ranking factor - 
This confidence factor, ranging from 0 to 1, is a 
predetermined estimate of the reliability of a particular 
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5.3. THE AGGREGATION ENGINE 

In the preferred embodiment the functions previously 
identified for the aggregation engine component of this 
invention are performed by the following process. Searches 
5 of the information source proceed in parallel because all 
requests are transmitted without waiting for any responses. 

1. Receive a user query from che user interface module; 

2. Perform query routing to determine the N information 
10 sources most relevant to the user query; 

3. For each of the N relevant information sources, do: 
A- Retrieve the wrapper for this information source 
(for example, from the source itself or from a local or 
remote wrapper database) ; 

15 Guided by the wrapper, format the user query into 

the form or format required by the information source; 
C. Transfer the translated command to the I/O module 
for transmission to the information source; 

4. Initialize the list of responses to be empty; 

20 5. Until a user specified time limit is reached, do: 
A. When an information source response has been 
received by the I/O module and transferred to the 
integrator, then: 

i. Guided by the wrapper for this information 
25 source, parse the response to understand the 

information returned, discard the site-specific 
formatting text and other irrelevant matter, and 
gather relevant fields into tuples; 

ii. Add each tuple to the list of tuples, 

30 optionally performing priority ranking, duplicate 

elimination, etc. ; 
B. Wait for the next response; 
6. Deliver the list of created tuples to the user interface 
module on request, which can be due, for example, to 
35 user activation of the show-me-button or more-button 

controls . 
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Input: List of K sources with their confidence rank pairs 
(source_k, crankjc) , obtained from the query 
routing system; K ordered lists of tuples, 
tuples_k, of length lengthjc, obtained from source 
5 source_k; and the verification flag V (Boolean) , 

obtained from user preferences 
Output: Merged list of all tuples sorted by relevance. 
/* Main routine */ 
IF (V is true) THEN 
10 FOREACH k « 1..K 

FOREACH tuple t in tuplesjc 

page = the HTML page that tuple t points to, 

downloaded if necessary 
t. score = Analyze (page) ; 
IS ELSE /* V is false -/ 
FOREACH k = 1 . . K 

Normal izeScores (tuples_k) 
AdjustByHeight (tuples Jc) 
AdjustByServiceRanking (tuples_k) 
20 /* Merge result tuples, t, into MERGED^ LIST and 

determine a composite relevance score, t. score, from the 
scores returned by the information sources, 
t.sourcescorejc; the same tuple returned from multiple 
sources has its composite score incremented by the 
25 source score from each source */ 

FOREACH k = 1. -K 

FOREACH tuple t in tuples_k 

IF t is not in the MERGED_LIST THEN 
Add t to the MERGED_LIST 
30 t. score = t.sourcescore_k 



35 
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information source's own relevance estimates. For example, 
it can be determined by practical experience with the 
source's estimates. Where the same tuple is returned from 
two or more sources, the relevance values from all those 
5 sources are combined. Optionally, the relevance estimates 
returned from each source are adjusted to have a uniform 
distribution in the normalization range. 

In a particular detailed embodiment, this determination 
is performed according to the following process. Here, query 

10 routing has determined a list of K information sources, 
source_k with k from 1 to K, and returned their confidence 
ranks, crank_k. Each of these sources has been queried, 
returned responses, and K lists of information tuples, 
tuples_j where j is from 1 to length_k, have been extracted 

15 from these responses. The user's preference for netbot 
analysis is recorded in verification flag V. The variable 
t. score represents the composite relevance score for tuple t; 
the variable t. sourcescore_k represents the relevance 
estimate returned from information source_k for the response 

20 that tuple t was extracted from. 
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percent_step = 100 / Lengthk; 

percent_off = 0; 

FOREACH tuple t in tuples_k 

t.sourcescore_k = t . sourcescore_k * (100 - 
5 percent_off) / 100; 

percent_off « percent_off + percent_step; 

ENDSUB 

SUBROUTINE Ad justByServiceRanking (tuples_k) 

/* Each Service is assigned a percentile ranking indicating 
10 the confidence given to its returned source scores; this 

ranking is used to scale the returned source scores for each 

tuple; for example, a 90% confidence ranking means that each 

source score is scaled by 0.9 */ 
FOREACH tuple t in tuples_k 
;5 t . sourcescore_k = t . sourcescors_k * crank_k 

ENDSUB 

One of skill in the art will recognize that these 
processes are amenable to routine alterations and 

20 enhancements that perform the same functions in the same 
manners. In particular other values for the normalization 
range and default value for the information source relevance 
estimates can be used. This invention includes such routine 
alterations. These processes are preferably implemented in 

25 C++, but can alternatively be implemented in any procedural 
or object-oriented programming language. 

5.4. T qB OUBRY ROUTER 
The query router receives as input a user query 
3 0 expressed as a list of words or keywords and returns as 
output a list of N information sources ordered by their 
likely relevance to the input query. Determination of these 
information sources is optimized for speed and over 
inclusiveness . Occasionally including an irrelevant 
35 information source is preferably to omitting a relevant 
source. 
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ELSE 

t. score = t. score +■ t . sourcescore_k 

END IF 

ENDIF 

5 SORT all tuples t by t. score and discard duplicates 
OUTPUT sorted tuples 

EXIT /* finished relevance ranking */ 
/* Subroutines */ 

SUBROUTINE Norma lizeScores (tuples_k) 
XO /* If information source Jc returns relevance estimates, 

normalize them to fall in the range from 0 to 1000; 

otherwise, use a default relevance estimate */ 

/* "s" is the k'th information source's relevance score 
for the first tuple on the list of tuples from source^ 

15 * / 

s = tuples_k[l] .sourceseore_k 

IF (s == 0) THEN 

/* this information source returns no scores, 
therefore use default */ 
20 FOREACH tuple t in tuplesjc 

t*sourcescore_k = 1000 

ELSE 

scaling_f actor « 1000.0 / s; 
FOREACK tuple t in tuples_k 
25 t.sourcescorejc - t . sourcescorejt * 

scaling_f actor ; 

ENDIF 
ENDSUB 

SUBROUTINE Ad justByHeight ( tuplesjc) 
30 /* Adjust the source scores to have a uniform percentile 
distribution; for example for 10 tuples, the first tuple's 
source score is adjusted to 100% of its source score, the 
second tuple's source score is adjusted to 90% of its souro 
score, etc. */ 

35 
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10 



15 



The preferred implementation performs the following 
process. 

1. For each of the s information sources, set RELEVANCE [ j ] 
= DEFAULT-RELEVANCE [j ] ; 

2. For each word in the user query, do: 

A. Compute a hash function on the word obtaining a 
number, M, between zero and W. Any suitable hash 
functions is adaptable to this process. An exemplary 
hash function is found in Sedgewick, 1990, Algorithms in 
C Addison-Wesley Publishing Co., chapter 16. 
B '. Let the C bit vector V equal to W0RD2C0NCEPT[M] ; 
C. Combine the relevances for all the concepts in V to 
the relevance for the information sources by performing: 
For i from 0 to C, do: 
If (i-th bit of V is "l') THEN 
FOR j = 1 to S DO 
RELEVANCE [ j ] = RELEVANCE ( j ] + 
CONCEPT2 SOURCE [ i , j } 
20 Monotonically increasing function other than «+« can 

also be used to combine the individual concept 
relevances into a final relevance; 

D. combine the relevances for all the words in the 
user query, for example by adding them together. 
25 3. in the case of searching by phrases, additionally do: 

A. Concatenate all the words of the user query phrase; 

B. Compute the hash function on the phrase and 
obtaining M, and set the c bit vector v equal 
WORD2CONCEPT[M] ; . 

30 c. Combine the relevances for all the concepts in V to 

the relevance for the information sources by performing: 
For i from 0 to C, do: 
If (i-th bit of V is '1') THEN 
FOR 3=1 to S DO 

RELEVANCE [ j ] = CONCEPT 2 PHRASE [ i,j ] 
Sort information sources based on their RELEVANCE , and 
return the N most relevant sources. 
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The preferred query router is based on the principle of 
assigning relevant concepts to information sources and query 
words. in advance, a set of concepts is chosen to describe 
the information sources of the one or more information 

5 domains to which one or more netbots are directed. For each 
information source in the domains, the relevance of that 
source to each of the chosen concepts is judged. Further, 
each word that can appear in possible queries is examined to 
determine which of the chosen concepts are relevant to the 
10 word. Then, upon receiving the words or keywords of a query, 
the concepts associated with these words are determined, and 
then the information sources relevant to these concepts are 
found. The ranked relevance of each source is determined by 
combining the individual relevances of the source to all the 

15 concepts o£ the query. The cass of phrase based queries is 
preferably handled by generating separate data for this query 
type. 

The preferred implementation of this process utilizes 
four tables containing relevance information. In the 
20 following, w is chosen to be somewhat bigger, e.g. 10%, than 
the number of words that can appear in possible queries; c is 
the number of chosen concepts; and S is the number of 
information sources in the information domain. 
WORD2CONCEPT[] is a table of W vectors of C bits, where the C 
25 bits of the vector for a word indicate which of the C 

concepts are relevant to that word. CONCEPT2 SOURCE [ ] [ ] is a 
C by S table. For each of the C concepts and S sources, the 
corresponding entry of this table contains the relevance 
value of that source to that concept. For example, if entry 
30 <i, j> equals 5, the j-th information source has a relevance 
weight of 5 with respect to the i-th concept. 
CONCEPT2 SOURCE [] [] is used when searching by words. For 
searching by phrases, the table CONCEPT2 PHRASE [ ] [ \ similarly 
relates concepts to sources. Finally, DEFAULT-RELEVANCE [ ] 
35 has a default relevance weight for each of the S information 
sources . 



- 27 - 



WO 98/12881 PCT/US97/17132 



5.5. THE WRAPPER DEFINITION LANGUAGE 

This section first presents introductory material on the 
use of the WDL in wrappers. Next, the two principal 
5 components of the WDL - the action language and the regular 
expression language - are described in detail. Finally, an 
exemplary embodiment of the WDL is presented. 

A wrapper is a description of an information source and 
how a Netbot should interact with it, in particular how to 
10 format requests to it and how to understand responses from 
it. NetBots need to interact efficiently with hundreds to 
many thousands of information sites on a network. This 
interaction presents two requirements: first, compact storage 
of a description of an information source using a 
IS representation encoded in the WDL; and second, use of tnis 
description to understand the information source, currently, 
for example, netbots need to format requests to a source and 
to parse useful information from the pages returned by the 
source while ignoring irrelevant formatting information. As 
20 information sources become more functional, netbots will need 
to process interactions more complex than simple request- 
response pairs. Therefore, wrappers and the WDL preferably 
have the following features: 

1) WDL descriptions are easy to write, easier than, for 

25 example, in C++. This is important because new information 
sources are created frequently and existing services change 
their format frequently. Optionally, wrapper descriptions 
can be automatically generated using machine learning 
techniques in information domains where responses have a 

30 regular, predetermined format. 

2) Wrapper description are small. This is important so 
they can, for example, be stored locally in a database or 
quickly transmitted from a server to a netbot running in a 
client, even over a slow network connection. Optionally, 

35 information sources can supply their own wrappers on request, 
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One of skill in the art will recognize that this process is 
amenable to routine alterations and enhancements that 
performs the same functions in the same manner. This 
invention includes such routine alterations. 
5 This process is preferably implemented in C++, but can 

alternatively be implemented in any procedural or object- 
oriented programming language. In the case of a query router 
which maintains information on a large number, e.g., tens of 
thousands, of information sources, the query router is 

10 preferably implemented as a server process on a server 
computer to accommodate the size of the required data 
structures and the processing requirements of query routing. 

An exemplary construction of the table WORD2 CONCEPT 
begins with the selection of concepts to characterize the 

2.5 information domain or interest and the determination of words 
and phrases likely to occur in user queries. For each 
concept, the following actions are performed. The words and 
phrases associated with that concept or to which the concept 
relates are assigned to the string arrays KEYS [ ] and 

20 PHRASESf]. Then the following process is carried out. 

FOR i equals 1 to the number of elements in KEYS[], DO 
Apply the previously used hash function to KEYS[i] 
to obtain a number between 0 and W 
25 SET the bit matching the current concept in 

W0RD2C0NCEPT[M] . 
FOR j equals 1 to the number of elements in PHRASES[], 
DO 

Apply the previously used hash function to KEYS [ i ] 
30 to obtain a number between 0 and W 

SET the bit matching the current concept in 
WORD 2 CONCEPT [M] . 

These actions are repeated for each chosen concept. 
35 Alternatively, concept information can be stored along with 
the string information, using either open or closed hashing, 
in order to preserve accurate string to concept matching. 
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Second, the preferred implementation compiles regular 
expressions into compact and efficient finite state automata. 
Third, it encourages the efficient and intuitive expression 
of complex regular expression in a nested manner. And 
5 fourth, it possesses an efficient backtrack-free padding 
facility. 

An exemplary wrapper follows which is suitable for an 
internet information search engine and is written in the WDL. 
The significance of each portion is explained in comments, 
10 which are surrounded by "/* M and ••*/." 

/* A list of input query words is passed to the wrapper 
from the aggregation engine, which is executing the 
wrapper for the information source. The argv() function 
15 of the action language extracts a li«t-. of the input 

query words. */ 
{$keywords = argv(2); 

/* The request to be sent to the information sou *- cc is 
20 calculated by concatenating, indicated by the 

operator of the action language, three separate strings: 
one, the Internet URL address of information source and 
the initial query format string; two, the query words to 
search; and three, the remainder of the query format 
25 string for this information source; and */ 

$url = "http:/ /searcher. source.com/searcher. cgi?query 

. $keywords . "&onlyrr=0" ; 

/* The fetch action statement transfers this query the 
30 I/O module for network transmission, and then waits for 

the HTML formatted response. */ 
$page = fetch(0, $url,"); 

s 

/* The HTML formatted response text is parsed using the 
35 following regular expression grammar. */> 

$result = parse ($page, <page>) ; } 
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3) WDL descriptions can be automatically compiled into fast 
finite state automata that quickly parse the information 
returned by information sources. 

4) Using wrappers with WDL permits netbots to adapt to new 
5 types of information formats and new types of information 

server interactions in the future . 

The wrapper description preferably specifies at least 
two processes: first, requesting information from an 
information source, e.g., how to fetch the appropriate HTML 

10 formatted page from a source; and second, how to parse 

returned pages to extract the relevant data. To perform the 
first process, the WDL includes an action language component, 
which is an extensible language of expressions and 
statements. To perform the second process, the WDL includes 

15 a regular expression language component, v/hich is an 

extensive and novel means of specifying regular expression 
pattern matching facilities. 

In alternative implementations, netbots can utilize 
alternative pattern matching facilities known to those of 

20 skill in the art. For example, the regular expression 
component could be replaced with a context free language 
specification ( H CFL ,r ) . In this case, implementation of the 
WDL can follow techniques known for construction of compiler 
compilers, for example YACC. However, where possible, 

25 regular expression pattern matching is preferred because of 
its straightforward specification and rapid execution. 

An Example Of The Regular Express ion Component 

Regular expressions are advantageous for describing the 

30 format of the information returned from many information 
sources. The regular expression component of the WDL 
augments prior regular expression matching facilities in 
several novel manners. First, it permits programming 
language facilities of the action language, e.g. statements 

35 and expressions, to be executed when regular expressions are 
recognized and with variable bindings as determined by 
partial matches recognized during the overall recognition. 
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/* An response from this exemplary source consists of a 
page containing zero or more information items. This is 
hierarchically described in the regular expression 
language by expressions for <page> and <item>, where 
<page> refers to <item>s. In particular, a page 
includes a chunk of text followed by the string "results 
returned . H , then followed by zero or more items . */ 
<page> stuff -results returned, ranked by' 

<item> * END 

/* Each item consists of irrelevant text, HTML 
formatting codes, and relevant data fields. The 
relevant data fields are enclosed in parentheses and 
referred to sequentially by $0, $1, and $2. These 
fields, along with the number 500, are output as a tuple 
when the " output ,r statement of the action language is 
executed upon recognition of an <item>. The particular 
meaning of the definition of <item> will become apparent 
in the next section. */ 

<item> stuff '<hr>' stuff '<center><bxa href="' 

(stuff) stuff '>' (stuff) '</*>' 

stuff '<font size=-l>' (stuff) '</font>' 
{ output ($0, $1, $2, 500); } END 



5.5.1. DESCRIPTION OF THE WDL COMPONENTS 

This section describes the preferred features of the 
action language and the regular expression components of the 
WDL. This invention is not limited to the descriptions 

3 0 presented. These descriptions make use of known methods for 
describing grammars, including regular expressions, and one 
of skill in the art will recognize that there are many 
substantially equivalent descriptions. Such equivalent 
descriptions include, but are not limited to, those resulting 

35 from renaming the described syntactic elements or from 

applying known grammatical transformations to the presented 
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syntax* This invention comprehends also these substantially 
equivalent descriptions. 

The Action Language Component 
5 The action language includes certain preferred base- 

level features: an assignment statement; sequencing 
statements such as "if" and "while" constructs; a tuple 
output statement; string and numeric expressions; string, 
numeric , and boolean operators; and certain built-in 

10 functions. It will be apparent to those of skill in the art 
that a base- level action language of equivalent semantic 
expressive ability can be constructed with slightly different 
choices of features. For example , the while statement can be 
replaced by a goto statement* The action language component 

15 of this invention comprenends such known equivalent 
formulations of the semantics disclosed. Further, in 
optional embodiments, the preferred base-level features can 
be augmented with such additional features as: additional 
sequencing statements such as "for" or "repeat" loops; user 

2 0 defined functions; additional string, numeric, and boolean 
expressions and operators such as are found in C or C++; 
additional built-in functions; and array variables. Such 
additional features can be added to the preferred base-level 
language in known manners. See, e.g., Aho et al., 1986, 

2 5 Compilers Principles, Techniques, and Tools , Addison Wesley 
Publishing Co. 

The syntax of the preferred base-level action language 
is given by the following grammar expressed in a standard 
notation. 



35 



- 34 - 



WO 98/12881 



PCT/US97/17132 



Statement 



5 



$VariableName "=" Expression " ; " 

j "if" Expression Statement [ "else" Statement ] 

| "while" Expression Statement 

| "{" Statement* "}" 



if 



output" " (" [ Expression { 



it 



, "Expression }* ] ") 



ii • it 



Expression : := 



10 



string-constant 

j float-constant 

| $variableName 

! Expression op Expression 

j "(" Expression ")" 

1 functionName "(" [ Expression { 



"/'Expression }* ] ") 



it 



15 



Thus, the action language component is defined in terms 
of statements and expressions. A statement can be: an 
assignment statement , which assigns a new value to a scalar 
variable; an if statement; a while loop; a compound 

20 statement , which is a sequence of other statements enclosed 
by "{" and "}"; or an output statement. Except for the 
output statement , the statements function in the same manner 
as in other procedural programming languages, e.g., C, C++, 
or Pascal. For the if and while statements, the conditional 

25 argument is considered true if it is either a non-zero number 
or a non-null string, i.e., not the empty string HM . The 
OUTPUT statement allows a wrapper to return information 
matched in responses froir the information source to the 
netbot module executing the wrapper. For example, executing 

30 the statement "output (arg_l, arg_n) " causes the tuple 

<arg_l , . . . , arg_n> to be returned from the wrapper to the 
netbot . 

An expression can be: a string constant, which i£ a 
symbol string surrounded by quotation marks; a floating point 
35 number; a variable name, which is a name or an integer 

preceded by a dollar sign; an infix operator applied to two 



- 35 - 



WO 98/12881 



PCTYUS97/17132 



surrounding sub-expressions; a sub-expression, which is 
enclosed in and ")"; or a call to a built-in function. 

Further, the language provides operators standard in 
procedural programming languages, including the following: 
5 arithmetic operators ("+", "/") ; numeric comparison 

operators ("<", ">", »<=", »>=», "!=»); string 

comparison operators ("It", "eq" , "gt", " le" , "ge" , "ne"); a 
string concatenation operator ("•"); and boolean operators 
("&&", "jl", "!"). These operators have the following 
10 semantics: 

1) + f *, / : perform the indicated arithmetic operation 
on the surrounding expressions; 

2) . : concatenate the surrounding strings; 

3) — , <=, >=, <, >, != : perform the indicated numeric 
15 comparison of the surrounding numbers and return tnc 

floating point values 0.0 or 1.0 accordingly; 

4) eq, le, ge, It, gt, ne : perform the indicated 
character-by-character comparison of the ASCII codes of 
the surrounding strings and return the floating point 

2 0 values 0.0 or 1.0 accordingly; 

5) ! : return 1 if argument is the number 0 or the string 

, otherwise return 0; 

6) && : evaluate the first argument, if the first argument 
is zero or stop and return that value, else 

25 evaluate the second argument and return the latter 

value; 

7) | | : evaluate the first argument, if first argument is 
neither zero nor 11,1 , stop and return that value 
immediately; else evaluate the second argument and 

30 return the latter value. 

These operators have the following precedence in order from 
highest to lowest: 

1. any expression inside parentheses (highest precedence) 

2. *, / 

35 3. +, . 

4* ==/ <=/ >==/ <, >, ! = , ne, eq, le, It, gt, ge 
5. ! (boolean not) 
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6. && (boolean and) 

7. || (boolean or) (lowest precedence) 
All operators are left-associative. 

No variable declarations are needed. First, all 
5 variables in the action language are distinguished by being 
preceded by a "$". Those special variable denoting sub- 
strings matched by regular expression consist of a 
followed by an integer, e.g., $0 or $1. Second, at run time, 
automatic type conversions, between string and floating point 

10 types, is done dynamically. If an operator expects a number 
but gets a string, the string is converted to a number by 
calling the C library function atof(), which converts the 
ASCII representation of a number into its internal 
floating-point representation. If an operator expects a 

15 string argument but gets a number, it uses the C library 

function sprintf ( . . . , "%f ") to convert it to a string- Third, 
referenced variables that have not yet been assigned are set 
the default values 0 or as appropriate. 

The action language component has several built-in 

20 functions. The following are preferred base-level built-in 
functions . 

1) argc(), argv() : when a netbot executes a wrapper, 
it can pass the wrapper one or more arguments. These usually 
represent query parameters, query words, or query keywords 

25 supplied by the user. These arguments are accessed from 
within the wrapper language by the functions argc ( ) ; which 
returns the number of arguments passed to the wrapper , and 
argv(n), which returns the n-th argument, 

2) fetch(): This function preferably interfaces with 
30 the I/O module to transfer a string containing the network 

address of an information source and perhaps query parameters 
over a network to the addressed information source according 
to the proper protocol, and returns a string containing the 
response of the information source. Wrappers use this 
35 function to query information sources and retrieve pages. 

3) parse (<string>,<nonterminal>) : This function takes 
a string and attempts to match it to the regular expression 
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corresponding to the given <nonterminal> , as defined by the 
regular expression language component of the WDL. This 
function returns M i M if the match was successful, or 11 o" if 
the string did not match the regular expression. Wrappers 
5 use this function to parse the response of an information 
source. 

An exemplary wrapper includes the following sequence of 
action language statements. First is a series of statements 
using argc() and argv() to obtain the user query parameters 

10 and to initialize a string variable, e.g., $url, with a 
string containing the URL of the appropriate page at the 
information source together with an appropriately formatted 
query- Next, the assignment statement "$html_text = fetch 
($url) H fetches the query-response page into another string 

i» variable, e.g., $html_text . Finally, thp function 

parse ($html_text , <page>) attempts to matcn the returned html 
text with the regular expression, <page>, which describes the 
sought page, 

20 The Regular Expression Language 

The regular expression ("reg-exp") component of the WDL 
matches strings against regular expressions. It has been 
found that regular expressions are convenient to describe the 
format of responses returned from a wide range of information 

25 sources. However, the reg-exp language of this invention is 
capable of matching this information so that relevant fields 
can be extracted in a more rapid and more convenient manner 
than can prior languages and systems, such as AWK or Perl. 
The regular expression component includes novel facilities 

3 0 that solve the problems with prior matching systems, and 
thereby, permit its use by netbots. 

A first novel facility allows the specification of 
regular expressions to be broken into pieces in a manner 
similar to a context-free grammar, which specifies a language 

35 by a set of rules for nonterminals in the grammar. See, 

e.g., Aho et al., 1986, Compilers Principles, Techniques, and 
Tools , Addison Wesley Publishing Co., section 4.2. Writing a 



- 38 - 



WO 98/1288! PCT/US97/17132 



single regular expression to represent the format of a page 
from an information source, as is required in prior systems, 
often results in a very large and cumbersome expression, one 
which is difficult to write, understand, and maintain* To 
5 solve this problem of existing systems, the reg-exp component 
specifies a regular expression by a set of rules, for 
components of the regular expression. These components are 
labeled by nonterminals. However, in contrast to 
context-free grammars, the set of rules in the reg-exp 

10 component are not allowed to be recursive or 

mutually-recursive. In other words, the rule for a 
particular nonterminal cannot directly or indirectly refer to 
other rules which refer to that particular nonterminal. 

The following exemplifies the use of a set of rules and 

15 nonterminals. A top-level nonterminal defining an 
information response can be: 

<page> ::= <head> <item>* <tail> END. 
which specifies that the response is a page consisting of a 
head, followed by zero or more items, followed by a tail. 

2 0 The keyword "END" denotes the end of a rule. The second- 
level nonterminals on the right-hand side ("RHS") of this 
rule, <head>, <item>, and <tail> are defined by their own 
rules: 

<head> ::= "Results of your search: \n" END; 
25 <item> ::= "Data: . . . \n" END; 

<tail> ::«= "No more results\n" END. 
To execute these rules, the reg-exp component compiler 
substitutes into the RHH of <page> the RHSs of the rules for 
<head>, <item>, and <tail>- The result is as if the wrapper 
30 contained the large, cumbersome composite top-level rule: 

<page> ::= "Results of your search: \n" ("Data: 
. ..\n M )* "No more results\n" END 
If the second-level rules had contained further nonterminals 
on their RHSs, the compiler would continue making appropriate 
35 substitutions until there are no more nonterminals on the RHS 
of the composite rule for the top-level rule. Because of the 
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lack of recursion or mutual recursion, this substitution 
terminates . 

A second novel facility permits skipping groups of 
characters in a string without backtracking. In prior 
5 regular expression matching systems, this quite common 
requirement is implemented in a manner that requires the 
storage of many backtracking points on a stack and can lead 
to extensive backtracking. For example, the prior art Perl 
idiom ".*", which matches any number of occurrences of any 
10 character, or the Perl idiom "r\d]*\d, which matches all 
non-digit characters up to the first occurrence of a digit, 
can cause extensive back-tracking during a match. This is 
inefficient and not preferable, since information responses 
should be rapidly parsed. See, e.g. Schwartz, 199 3, Learning 
15 Perl , O'Reilly & Associates, Inc.- chapter 7. To solve tnis 
problem, the reg-exp component introduces simple and direct 
syntax for this common idiom: 

stuff "literal-string" 
where "stuff" is a reserved word. Stuff is defined to match 
20 all characters from the current character, up to but not 
including, the first occurrence of the string 
"literal-string," which must be a string literal. This 
construct allows a compact, efficient, backtrack-free 
implementation of this common and important idiom. 
25 A third novel facility allows relevant data fields to be 

extracted from arbitrarily complex regular expressions. 
Action language statements can be embedded in the RHS of a 
nonterminal for execution whenever that nonterminal is 
matched and with the variable bindings current at each 
30 occurrence of a match of that nonterminal. In the case of 
the previous example, the definition for <item> can be 
extended as follows: 

<item> ::= "Data:" (stuff) "\n« { output^; } END. 
Whenever <item> is matched, $0 is set to the string matched 
35 by stuff in the parenthesis, and then the output statement is 
executed. In this case, $0 gets bound to whatever characters 
followed "Data:" and preceded the newline character. 
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Prior systems, for example AWK and Perl, do not have 
such a facility. Although they do set variables, such as $0, 
$1, etc., they do so only after matching the single composite 
regular expression defining <page>. In order words, the 
5 entire page is matched before variables are set. Therefore, 
in AWK and Perl, if there is more than one <item> on the 
<page>, the relevant data fields in all but the last <item> 
are lost . The reg-exp component of the WDL solves this 
problem by allowing specification of actions that are 

10 executed multiple times, once for each nonterminal match, 
with different variable bindings each time. 

Turning now to the description of the reg-exp language, 
it includes certain preferred base-level features: definition 
of nonterminal regular expressions; embedding action language 

15 statements in regular expression rules; operators lor 

expressing alternation and repetition; and literal string 
match. It will be apparent to those of skill in the art that 
a preferred base-level reg-exp language of equivalent 
semantic expressive abilities can be constructed with a 

20 slightly different choice of features, and the reg-exp 
language component of the WDL includes such well-known 
equivalents. In particular, the reg-exp language comprehends 
variable renamings and known grammatical transformations 
applied to the rules below. 

25 In optional embodiments, the preferred base features can 

be augmented with such additional features as: a special 
disjunction ("ll")/ repetition for an arbitrary integer; 
matching to character classes; local string memory; and 
anchoring characters. Such additional features are standard 

30 with the exception of the special disjunction, which causes 
the listed alternatives to be matched simultaneously as the 
string is parsed, the first alternative to match being 
returned. In contrast, the regular disjunction matches each 
alternative in the sequence listed, backtracking until 

35 success is found. The additional features can be added to 
the base- level language in known manners. see, e.g. 
Schwartz, 1993, Learning Perl , O'Reilly & Associates, Inc.; 
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Aho et al., 1986, Coropilers Principles, Techniques, and 
Tools , Addison Wesley Publishing Co,; Hopcroft at el., 
Introduction to Automata Theory, Languages, and Computation . 
Addison Wesley Publishing Co. 
5 In a further alternative embodiment, declarations could 

be added to control backtracking in order to improve 
performance of the regular expression matching . If a certain 
portion of a rule were Known to require no backtracking, that 
portion can be bracketed with declarations instructing the 
10 compiler and interpreter to generate a finite state machine 
without any provision for backtracking* This permits that 
portion of the rule to be matched more efficiently than 
otherwise* 

The syntax of the preferred base-level action language 
X5 is given by the following grammar expressed in a standard 
notation. 



20 



Rule 
Regexp 
Sequence 
Repetition 



25 



Term 



30 



Action 



= <nonterminal> Regexp "END" 

= Sequence ( | Sequence ) * 
= Repetition* 
= Term 
Term 
Term H + M 
Term 

:= string_In_Double_Quotes 
| " stuff" 
| «(" Regexp M ) w 
| <nonterminal> 
| Action 

::= statement* "}•* 



Briefly, a rule specifies a particular <nonterminal> to 
recognize a particular regular expression- A regular 
expression can include: disjunctions (" | M ); sequences; zero 
35 or more repetitions ("+ H ) ; one or more repetitions ("+"); and 
zero or one repetition ("? H ) . Terms can be: literal strings 
enclosed by double quotation marks; the special symbol stuff; 
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parenthesized regular expressions; nonterminals, which must 
be defined another rule; or statements written in the action 
language. 

5 The Wrapper Description Language 

A complete wrapper description can be: 
WrapperDescription Statement Rule* 

Thus the preferred WDL entities include a statement in the 
action language, which is typically a sequence of action 

10 language statements, followed by an optional set of rules in 
the reg-exp language, which define a regular expression for 
matching a response returned from information sources. 

To execute a complete wrapper description, the statement 
is executed as is described subsequently. Typically, the 

15 statement contains calls to the built-in functions fetch () 
and parse(), among others. The parse () built-in function 
attempts to match a response returned by fetch () by invoicing 
a nonterminal defined by the appended rules, each of which 
defines a regular expression. If the regular expression 

2 0 match is successful, all the action language statements 
typically embedded in the regular expression are executed, 
typically some action statements embedded in lower level non- 
terminals are executed multiple times with the operand 
bindings current at each occurrence of a match , and the 

25 parse () function returns the value U 1 M . If the match is 
unsuccessful, none of the embedded statements are executed, 
and the parse () function returns the value "0". 

5.5.2. IMPLEMENTATION OP THE WDL COMPONENTS 

30 The preferred implementation of the WDL is described in 

this section under the following headings: (1) parsing 
regular expression rules, (2) intermediate code generation 
for regular expressions, (3) run-time interpretation »of 
regular expressions, (4) action language code generation and 

35 run-time interpretation. It is understood that this 

preferred implementation of the WDL is exemplary. It is 
known to those of skill in the art that alternatives exist to 
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the implementation described herein. For example, the 
processes described can be implemented differently, e.g., 
with different variables and different orderings of the 
individual steps. Also, the processes may implement 

5 alternative algorithms to achieve the same effects for, e.g., 
string matching. For a further example, the described 
implementation discloses processes of interpreting various 
intermediate codes. Different intermediate codes can be 
used. For the regular expression language, different types 

10 of nodes are possible including, e.g., nodes having a 

variable list of successor states to avoid having to maintain 
a backtracking stack of successor states. For the action 
language, instead of the disclosed address-based intermediate 
code, equivalently a reverse Polish stack-based intermediate 

15 code could be used. Finally, instead of interpretation, it 
is possible to compile directly to a machine language using 
the disclosed syntax-directed methods. 

In the remainder of this section, a parse tree node that 
has label "1 H , contains data "d", and has child nodes M c_l, 

20 ... ,c_n" is denoted by »Kd;c_l, . . . , c_n>" . These parse tree 
nodes can be constructed and referenced in way known in the 
art, e.g., by a pointer to a data area containing node data. 

PARSING REGULAR EXPRESSION RULES 
25 The first step in the preferred implementation of 

compiling a wrapper description is parsing the rules of the 
reg-exp language and the statements of the action language. 
The input to this step are the rules of the reg-exp language 
defining the regular expression to match. The output from 
30 this step is intermediate code in the form of a set of parse 
trees, one parse tree for the top-level rule and additional 
parse trees for each lower- level rule. 

In a preferred embodiment, the process of thid step is 
performed by parsing according to a recursive descent parser 
35 and emitting the parse tree according to a syntax-directed 
translator. Construction of a recursive descent compiler foi 
the previously described syntax of the reg-exp language is 
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well known to those of skill in the art- For example, it is 
clearly disclosed with examples in such textbooks as Aho et 
al., 1986, Compilers Principles, Techniques, and Tools , 
Addison Wesley Publishing Co. at section 2.4 and 4.4. Syntax 
5 directed construction of parse trees is covered with examples 
in section 5.2. This invention is not limited to this 
preferred embodiment. Alternative parsing techniques are 
known in the art, and this invention comprehends embodiments 
using such techniques, such as LL or LR parsing. See 
10 generally Aho at al. at chapter 4. This invention also 
comprehends alternative techniques of intermediate code 
generation known in the art. See generally Aho et al. at 
chapter 5. 

The specification for the syntax-directed analysis 
15 includes rules ror the parse-tree nodes to be created vhen 
each syntax rule is recognized by the parser. For each of 
the previous reg-exp syntax rules the following nodes, 
labeled by the nonterminal of the rule, are created: 

20 1. Rule: Create the node "Rule <nonterminal-name; 
node-for-regexp>. " This is the node for an entire rule 
labeled with its nonterminal name. 

2. Regexp: Create the node "Alternatives 

<node_f or_sequence_l , . . . , node_f orsequence ji> . " This is 
25 the node for a disjunction ("|")' of alternative regular 
expressions, which are tested for a match in the order 
listed, the first successful match being returned. 

3. Sequence: Create the node "Sequence 

<node_f or_repetition_l . . . , node_f or_repetition_n> . » This is 
30 the node for a sequence of regular expression patterns each 
of which must sequentially match. 

4. Repetition: Create the node "Repetition <type; 
node_for_term>. " This is the node for a repetition, where 
type is one of "?" (0 or 1 repetition), "+" (1 or more 

35 repetitions), or "*" (0 or more repetitions). 
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5. String_In_Double Quotes: Create the node "String 
<literal_string>. w This is the node to match a literal 
string . 

6. Stuff: Create the node "Stuff <>" for the reserved word 

5 "Stuff." 

7. (Regexp) : Create the node "Sequence 

<OpenParenthesis<i> , node_f or_Regexp, CloseParenthesis<i>> . " 
This is the node which causes assignment to a sequentially 
numbered variable on a match by "Regexp," The node 
10 "OpenParenthesis<n>" is a node representing the n'th open 
parenthesis that is encountered in sequentially parsing the 
RHS of the current rule. The rule "CloseParenthesis<n>" is a 
node represented the corresponding n'th close parenthesis 
encountered. 

15 8. <nonterminai> : Create the node "Nonterminal 

<nonterminal_ name>, " for an instance of a nonterminal. 
9. Action: Create the node "Action 

<node_f or_action_language_statement>. " This is a node for 
an instance of an action language statement in which the 
20 node_f or_ action_language_statement represents intermediate 
code for the action language statement. 

These rules . which are executed when the parser 
recognizes the corresponding reg-exp language syntax rule, 
25 cause the generation of a parse tree, which is formed from 
the listed node types linked in a true structure. This parse 
tree is input to subsequent steps of this process. 

INTERMEDIATE CODE GENERATION FOR REGULAR EXPRESSIONS 
30 Code generation in the preferred embodiment includes the 

following three steps: 

1. Special preprocessing of stuff nodes; 

2. Eliminating occurrences of nonterminals i*n the RHS 
of the top-level nonterminal; and 

35 3. converting the resulting RHS of the top-level rule 

into a finite state automaton ("FSA" ) . 
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These steps are, first, generally described, and second, 

described in detail. 

First, preprocessing of stuff nodes is necessary to 
later use the preferred postorder traversal method of FSA 

v>« KnUt for a stuff element 
5 generation. As an FSA cannot be built tor a s 

file it is necessary to build an FSA for the combination of 
a stuff node and the literal string node that must follow 
each stuff element, only open and close parentheses can 
intervene between a stuff node and its following literal 
10 string. To do so, the parse tree is traversed and each stuff 
node is merged with the following literal string by replace 
Toth nodes with a new stuf f.And.String node. The new node 
also accounts for any intervening parentheses. 

The second step in code generation substitutes all 

, -r, *-h« shc- ~f th* top- level not.termmax wun 
IS nonterminals in the kH^ — tn. ~r ite rule for 

their respective RHSs to generate a single composite rule to 
Z overall regular expression. During this substitution it 
is necessary to renumber occurrences of variables in action 
language statements. 
20 For example, consider the wrapper 

<page> ::= ("dave") ("bill") <a> ("dan") < 
output ($2); } END 

"oren" (stuff "cody") { output($0); > END 
Here $2 in the output statement in <page> references 
25 rdan" while SO in the output statement in <a> references 
stuff .W). After substitution, if variables in the 
output statement were not renumbered, the HHS of < P age> would 

be expanded as follows: , a4M1 * f »codY") { 

< P age> ("dave") ("bill") "oren" (stuff cody > 

output(SO); } ("dan") { output ( $2 ) ; > 

* n 4-0 f «dave") instead of (stuff 

Here, however, $0 refers to ( dave ; .«.,,„* 
,. cod ;..,, vhll. » refers to (stuff "cody, -stead of 
,,. da ;.., . To maintain correct variable assignment, the 
variable references must be renumbered as follovs: 

<pa,e> === ("dave", ("bill") »°ren» (stuff cody 
output(S2); > ("aan") ( output($3); } 
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The third step in the preferred embodiment of the code 
generation process converts the processed and expanded RHS 
into an FSA by performing a postorder traversal of the parse 
tree representing the RHS of the top-level rule, during which 
5 each node is in turn converted into an FSA. The FSAs have 
several types of states, most importantly: 
1. char_branch: Branch to one of 256 possible successor 
states depending on the current character and advances the 
current read position; 
10 2. Action: Executes a block of action language statements; 

3. Marker: Records the current input string position, plus 
or minus a certain offset, on a mark stack; 

4. success: When reached, parsing has succeeded; 

5. Fail: When reached, the current attempt at parsing has 
15 failed, and the FSA must either backtrack to the prior 

backtracking point or fail if no backtracking points 
available; 

6. Push: Pushes a configuration comprising an FSA state and 
the current input string position, onto the backtracking 

20 stack for later backtracking. 

Code Generation Step 1: P reprocessing 

The preferred pre-processing of stuff nodes begins by 
expanding or flattening any nested sequences, i.e., 
25 Sequenceo elements nested within other sequenceo elements. 
This is done according to the following process: 

FOR nonterminal PERFORM a POSTORDER traversal of its 
parse tree 
30 At each parse tree node v 

IF (v=Sequence<c_l, . • • ,c_n>) THEN 

WHILE (there is an i, K=i<=n, such that 
c_i=Sequence<d_l , . . • , d_m> ) DO 

Replace node v with a new node 
35 Sequence <c_l, . • - ,c_i-l, d_l, ...» 

d m, c_i+l, c_n> 
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Next f each occurrence of a stuff node is replaced by a 
Stuf f_And_String node that also accounts for the following 
literal string and any intervening parentheses. Every 
occurrence of "stuff" in a wrapper must be followed by a 
5 literal string with the semantic result that "stuff" matches 
all characters from the current position up to, but not 
including, the first occurrence of the following literal 
string* Only open and close parentheses can intervene 
between stuff and the literal string* In the preferred 
10 embodiment, this replacement is preferably performed 
according to the following process: 

FOR nonterminal PERFORM a POSTORDER traversal of its 
parse tree. 

15 At each node v of the parse tree 

IF (v=Sequence<c_l , . . . , c_n>) THEN 

WHILE (there is an i, K=i<^n, such that 

c_i=Stuf f<>) DO 

Let j be the smallest j>i such that 
20 c - j=String<s> for some string s. 

IF there is no such j THEN signal an 
error . 

ELSE IF (any element ck (where i<k<j) is 
NOT either an OpenParenthesis or 
25 CloseParenthesis) THEN signal an 

error 

/* cj is node for the string */ 
ELSE replace node v with a new node = 
Sequence <c_l, . . . , c_i-l, 
30 stuf f_And_String<c_j ; c_i+l , . . . , 

c_j-l>, c_j+l, c_n> 

This process merges the stuff node, the literal string node, 
c_ j , and any intervening parentheses nodes, ... c_j-l, into 
35 the single node Stuf f_And_String<the-string; c_i+l, c_j- 
l>. 



- 49 - 



WO 98/12881 



PCT/US97/17132 



Code Generation Step 2; Substitution And Variable Renaming 

Having completed pre-processing, nonterminals on the RHS 
of the top-level rule are preferably substituted and any 
action statement variables renumbered. Substitution and 
5 variable renumbering preferably use a process having two 
functions, EXPAND_REGEXP and EXPAND_ACTION and a global 
variable PAREN_COUNT. EXPAND_REGEXP is called recursively to 
substitute for RHS nonterminals, starting with a call to the 
top-level nonterminal . PARENCOUNT is initialized to 0, and 

10 as substitution proceeds, it is incremented for each " (■' 
encountered, PAREN_COUNT thus has a count of the total 
number of H ("s encountered so far. Then during substitution, 
parentheses encountered are renumbered from their prior 
number to the current value of PAREN_COUNT and variables in 

15 action statements are renumbered in a corresponding fashion. 
For example, if PAREN_COUNT is currently 8, an 
OpenParenthesis<2> node is replaced by an OpenParenthesis<8> 
node, and an output ($2) statement is replaced by an 
output ($8) statement. EXP AND_ACT I ON performs action 

20 statement variable renumbering. These processes are as 
preferably performed according to the following: 

GLOBAL VARIABLE integer PAREN_COUNT; 

FUNCTION EXPAND_REGEXP ( <nt> : nonterminal) : RETURNS a 
25 parse tree with renumbered variables 

/* an array of integers with new variable numbers 
*/ 
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LOCAL VARIABLE new_names ( ] 

LET T = parse tree for the RHS of rule for <nt> 
PERFORM a PREORDER traversal of T. 
At each node v of tree T 

IF (v=OpenParenthesis<i>) THEN 

new_names[i] = PAREN_COUNT; 
REPLACE node v with 

OpenParenthesis<PAREN_COUNT> 
PAREN_COUNT = PAREN_COUNT + l; 
ELSE IF (v=CloseParenthesis<i>) THEN 
REPLACE node v with 

CloseParenthes is<new_names [ i ] > 
ELSE IF (v=Nonterminal<x>) THEN 

REPLACE node v with EXPAND_REGEXP (x) 
15 ELSE IF (v=Action<x>) THEN 

REPLACE node v with EXPAND_ACTION(x, 
new_names ( ] ) 

FUNCTION EXPAND_ACTION ( T : parse tree T for action 

language statement; new_names[] ) : RETURNS a parse 
tree with renumbered variables 

PERFORM a PREORDER traversal of the parse tree T. 

At each node v, 

IF (v denotes variable name $i (i an integer) 

THEN 

REPLACE node v with a new node with $i 
replaced by $(new_names[i] ) . 



20 



25 



rnrlr -rnrr^-" ^ creating A Finite State Automata 

The final step in the preferred embodiment of code 

30 generation creates an FSA representing the substituted and 
processed regular expression. Although algorithms for 
creating such FSAs are known, they have not in the past 
provided facilities, for example, for embedding action 
language statements in such FSAs. Accordingly, the following 

35 description focuses on those new features of the preferred 
embodiment of this process that are directed to supporting 
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action language statements and the other novel features of 
the reg-exp language. 

An FSA output from this process starts its execution in 
an initial state with an input pointer set to the start of 
5 the input string and then repeatedly executes one of six 
basic procedures according to its current state. These 
procedures set a new current state and typically effect one 
or more of the data structures accessed by the FSA. The six 
types of states and their procedures are as follows: 

10 (1) Char_branch<next_state_0 , , next_state_2 5 5> : 

When the current state is a char_branch state, the next state 
is selected according to the ASCII value, i, of the current 
input character as next_state_i , and the input pointer is 
advanced to the next character in the input string. 
1,5 (2) Marker<num, open/ ciosef lag, offset, next_state>: 

When the current state is a marker state, a record is pushed 
on a run-time stack known as the mark stack that contains the 
character position of a parenthesis encountered in the input 
string plus the indicated offset and an indication of whether 
20 this parenthesis is open or close, and the next state is set 
to next_state. An exemplary mark can push a record 
indicating that the fifth open parenthesis occurred at the 

current input position. 

(3) Action<compiled_action_statements, next_szate> : 

25 When the current state is an action state, those action 

language statements represented by compiled_action_statements 
are executed and the next state is set to nextstate. 

(4) Successor When the current state is the success 
state, a string has been successfully matched and this FSA 

30 terminates. 

(5) Push<state, next_state>: When the current state is 
a push state, the current configuration of the FSA is pushed 
onto a run-time stack known as the backtracking stack, and 
the next state is set to next_state. An FSA configuration is 

35 a record containing identification of at least the current 
state together and the current position in the input string. 
Push is used to support the non-deterministic constructs in 
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the regular expression language. An exemplary non- 
deterministic construct is the disjunction "reg-exprl | reg- 
expr2," which is matched by an FSA that, first, pushes a 
backtracking state onto the backtracking stack, then two, 
5 attempts to run the FSA corresponding to reg-exprl; if that 
fails, then it backtracks and (c) attempts to run the finite 
state machine corresponding to reg-expr2. 

(6) Failureo: When the current state is the failure 
state, a backtrack record is popped off the backtracking 
10 stack and the next state and current input pointer are set to 
the contents of the backtrack record. If the backtracking 
stack is empty, the FSA has failed to match the input string 

and terminates. 

creation of the FSA for an input regular expression is 
15 done using a postorder traversal of the previously produced 
parse tree. As the parse tree is traversed, to each node v 
is attached a finite state machine that matches the 
sub-expression represented by the sub-tree rooted at v. This 
attached finite state machine is the value of the variable 
v.machine. The final FSA output is the FSA that is attached 
to the root of the parse tree when the creation process 
completes. 

nurinn the traversal, certain finite state machines are 
created in accordance with methods already known to those of 
lS skill in the art. These machines recognize standard regular 
expression constructs and can be constructed by known 
methods. General references for the creation of such 
standard finite state machines include the following: Aho et 
al., 1974, ™» m.«.ian A ^ &™iv S is of Computer Algorithms , 
30 Addison Wesley Publishing Co., chapter 9; Aho et al., 1986, 
n ^,«™ Prin^^~ T^chni^ps *nd Tools, Addison Wesley 
Publishing Co., chapter 3; Hopcroft et al., 1979, 
yr »-~,..-«™ «■» Automat a Theory, Langu .gps , and Computa tion, 
Addison-wesley Publishing Co., Section 2.5; sedgewick, 1990, 
35 M.nrith ms in C . Addison-Wesley Publishing Co., chapter 16. 
Further the process disclosed adopts certain 
construction methods that can advantageously by changed m 
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alternative embodiment. For example , the machine 
representing a literal string can be constructed according to 
the Boyer-Moore or other string matching algorithm. See, 
e.g., Sedgewick at chapter 19. 
5 The preferred process is: 

PERFORM a POSTORDER traversal of the tree. 
At each node v, DO: 
IF <v«=OpenParenthesis<n>) THEN 
10 SET v. machine « Marker<n, "open", 0, 

Successor 
ELSE IF (v=CloseParenthesis<n>) THEN 

SET v. machine = Marker<n, "close", 0, 
Success<» 

15 ELSE IF 

(v=Action<node_f or_the_action_language_statements>) 
THEN 

SET v. machine = 

Action<compiled__action_statements, 
20 Success<» 

ELSE IF (v=Strin9<literal_string>) THEN 
SET len = length (the-string) 
FOR i = 1 to len DO 

branch_state[i] = Char_branch <s_0, s_l, 
25 . s_255>, where all the s_j's are 

Failureo EXCEPT the s_j where j is 
the ASCII code for the i'th 
character of the-string which is 
Successo 

30 SET v. machine = MAKERS EQUENCE 

( branch_state [ 1 ] , . * * , 
branch_state [ len] ) . 



35 
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ELSE IF 

(v=Stuf f_And_String<the-string,par_l, . . . ,par_k>) 
THEN 

/* m _i can be constructed according to known 
methods disclosed in the previously cited 
references */ 

SET m_l = a finite state machine that scans 
the input string for the first occurrence of 
the-string and stops as soon as it comes to 
the end of this first occurrence. 
FOR i = 1 to k DO 

SET par_i. machine « marker state 

appropriate to parenthesis par_i and 
having an offset of 
-length (the-string) . 
SET v. machine = MAKE_S EQUE1 iC E (m_l, 

par_l. machine, parjc. machine) . 

ELSE IF (v=sequence<element_l element_k>) 

THEN 

FOR i = 1 to k DO 

SET element_i. machine = machine 
" constructed according to this 
process for element_i 
SET v. machine = MAKE_SEQUENCE 
(elemental. machine, • • • < 
e lement_k . mach ine ) 
/* a disjunction, "I" */ 

ELSE IF (v=Alternatives<element_l element. k» 

THEN 

FOR i = 1 to k DO 

SET element_i. machine = machine 

constructed according to this 
process for element_i ' 
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SET v, machine - element_k. machine 
FOR i = k-1 downto 1 DO 

v. machine = Push<v. machine, 
elemental . machine> 
5 /* type is "?", or ••*» */ 

ELSE IF (v=Repetition<type, element>) ) THEN 

SET element* machine = machine constructed 

according to this process for element 
SET v. machine = a finite state machine 
10 constructed according to methods 

disclosed in the previously cited 
references for this type of repetition 
element . machine 

15 This process uses a HAKE_SEQUENCE function that: Duilds a 

composite machine from a series of sub-machines. The 
composite machine executes each of the sub-machines in 
sequence until one sub-machine fails, in which case the 
composite machine also fails , or all the sub-machines 

20 succeed, in which case the composite machine succeeds. In 
other words, the composite machine runs the first sub- 
machine; if that succeeds, it runs the second; if that 
succeeds, it runs the third, and so on. If any sub-machine 
fails, i.e., reaches a Failureo state, then composite 

25 machine also fails. 

FUNCTION MAKE_S EQUENC E (machine m_l, m_2 , . .., m_k) 
: RETURNS machine, 
newjm = m_l; 
30 FOR i = 2 to k DO 

Traverse the states of new_m and replace 
any Successo states with m_i 
RETURN newjm * 

35 This concludes the preferred construction of FSAs 

representing regular expressions from the reg-exp language ol 
this invention. Next, the FSA execution is described. 
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RUN-TIME INTERPRETATION OF REGULAR EXPRESSIONS 

In a preferred embodiment, the FSA representing a 
regular expression is executed by interpreting the structure 
created in the previous code generation steps. The 
5 interpreter preferably accesses several data structures 
including the current state of the FSA, a pointer to the 
current character position of the input string, a 
backtracking stack: and a marker stack. The backtracking 
stack contains records characterizing the state of 

10 interpretation of the FSA and is used in a manner known in 
the art to implement backtracking, in case an attempted 
partial match of the input string fails. Configuration 
records include the current state and the current input 
position. In an alternative embodiment, push states can be 

15 avoided t>y implementing states with a list o£ possible next 
states . 

The preferred implementation of the novel variable 
binding and action statement features of the reg-exp language 
of this invention utilizes an additional stack, the mark 

20 stack. The semantics of these features require that no 
actions be executed until the entire regular expression 
matching process succeeds. Upon success, all actions are 
then executed with the variable bindings that occurred during 
parsing. This means that a single variable, e.g. Si, can 

25 have different bindings in the same action statement which 
can be executed multiple times on match success. 

This semantics is implemented in the preferred 
embodiment using the mark stack in the following manner. 
When the current interpreted state is an Actiono state, the 

30 action language statement is not immediately executed. 
Rather the action statement code is pushed onto the mark 
stack. Similarly, when the current state is a mark state, 
which represents a parenthesis in the regular expression 
definition, the information present in the mark state is 
35 pushed onto the mark stack. This information permits finding 
the position in the input string of the beginning or ending 
of a current variable binding. To prevent execution of 
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action language statements encountered during a failed 
attempted partial match, the mark stack is popped to an 
appropriate level when the FSA backtracks- Thus, the machine 
configuration and the configuration record on the 
5 backtracking stack further includes the current position in 
the mark stack. 

Action language statements are executed upon final match 
success by scanning the mark stack from bottom to top. In 
this scan, variable bindings are set as indicated by mark 
10 state records and actions are executed as indicated by action 
state records. In case of match failure, no action language 
statements are executed* 

In more detail, an embodiment of the preferred 
implementation functions as follows: 

15 

GLOBAL VARIABLE current_state = initialized to the 

initial machine state 
GLOBAL VARIABLE inputjpos = initialized to the address 

of the beginning of the input string 
20 GLOBAL VARIABLE BT — STACK = initialized to an empty stack 

GLOBAL VARIABLE MARK — STACK » initialized to an empty 

stack 

REPEAT UNTIL one of the following clauses exits 
CASE (current_state) IS 

25 Char_branch<state_0 , , state_255>: 

current_state = state_ (ASCII value of the 

current input character) 
input_pos = input _pos + 1 

30 



- 58 - 



WO 98/12881 



PCT7US97/17132 



Marker<num,open/close_f lag, offset, nextstate> : 
push ( "mark" , num , open/ close_f lag , 

input_pos+offset) onto KARK_STACK 
current_state = nextstate 
5 Action<compiled_action_code , nextstate> : 

push ("action", compiled_action__code) onto 

MARK_STACK 
current_state = nextstate 
Push<state , nextstate> : 
10 push (state, input_pos, height (MARK — STACK) ) 

onto BT_STACK 
current^state = nextstate 
/* exit the REPEAT statement on final success */ 
Successor GOTO done 
15 Failureo: 

/* match finally fails if backtracking is 
attempted and the backtrack stack is empty */ 
IF BT_STACK is empty THEN 
RETURN "fail" 

20 ELSE 

pop BT_STACK record into variables st, 
ip, msh * 

current_state = st 
input_pos = ip 

25 pop MARK_STACK down to height msh 

END CASE 

/* when parsing succeeds, scan HARK_STACK and execute 
actions with variable bindings indicated by the marks */ 
done: 

30 LOCAL VARIABLE open_marks [ ] , close_marks [ ] 

FOR element = bottom of MARK_STACK up to top of 
MARK_STACK DO: 

IF ( element=( "mark", num , " open" , pos ) ) THEN , 
SET open_marks[num] = pos 
35 if (element= ("mark", num, "close", pos) ) THEN 

SET close_marks[num] - pos 
IF (element=("action",compiled_action_code) ) THEN 
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SET values of $0, $1, Sn to the positions 

in the input string indicated by 
open_marks ( n ] and close_raarks[n] 
RUN compiled_action_statement using these 
5 bindings of $0, $1, . .-f $n 

END FOR 

/* match succeeded and all action executed */ 
RETURN "succeed" 

10 For example, in the preceding FOR loop, the binding of $2 is 
set to the sub-string of the input string beginning at 
open_marks [ 2 ] and ending at close_marks( 2 ) . 

.pttoM T.ANCTIAGE C OnF GENERAT E AND RUN-TTME INTERPRETATION 

15 in a preferred embodiment, action language statements 

are compiled into a variable length bytecode-type of 
intermediate language that is executed at run-time by a 
bytecode interpreter. The previously described action 
language syntax is parsed by any appropriate known parsing 
20 technique, for example by LL or LR parsing. Adequate parsing 
techniques are presented with examples in Aho et al., 1986, 
r ollers PrjnH.l... Tech ^T'^- ™* Tools, Addison Wesley 
Publishing Co., chapter 4. During parsing all variables 
occurring in the wrappers are assigned unique numbers. A 
25 preferred such assignment assigns sequential positive 

integers to named variables, e.g., $x, $abc, in the order 
they are encountered, and assigns sequential negative 
integers to the numeric variables denoting matched 

«s-i o o»t the numeric variable 
sub-strings, e.g., $0, $1, $2, etc., tne munci. 

30 $i being assigned the number -(i+1). 

intermediate code is generated by known syntax-directed 
translation techniques. Adequate syntax directed translation 
techniques are presented with examples in Aho et al. at 
chapter 8. The preferred intermediate code is presented 
35 below. in this presentation, intermediate language 

instructions have the format of ah instruction code, with 
optional modifying information, followed by zero of more 
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arguments. Variables are denoted by <var> where var is the 
2-byte integer coding of the variable. Relative branch 
offsets are encoded as 4-byte integers. All branches are 
relative to the current instruction address, i.e., the 
5 address of the branch instruction itself, so the bytecode is 
relocatable* 

1. Function Calls: 

Encoding: 0 funcode numargs <var_0> <var_l> ... 

10 <var_numargs> 

Meaning: varO is assigned the result of the function 
identified by "funcode" on the arguments var_l, 
var_numargs of number "numargs." Each built-in 
function, e.g. agrc, argv, etc., and each operator, 

15 e.g., = tc " or tne ian< ? uage nas ltS OWn 

unique integer "funcode". 
2. Branch Instructions: 

Encoding: l <offset> <var0> 

Meaning: if varO is true, branch by amount offset 

relative to the branch instruction 
Encoding: 2 <offset> <var0> 

Meaning: if varO is false, branch by amount offset 

relative to the branch instruction 
Encoding: 3 <offset> 

Meaning: always branch by amount offset relative to the 
branch instruction 
Load Constant Instructions: 

Encoding: 4 <var0> <null-terminated-string> 
Meaning: varO is assigned the null-terminated-stnng 
constant 

Encoding: 5 <var0> < float ing-point-constant> 
Meaning: varO is assigned the floating-point-constant 
4. Move instructions: 

Encoding: 6 <var0> <varl> 
35 Meaning: varO is copied from varl 
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5. Output Instructions: 

Encoding: 7 <varO> 

Meaning: outputs one item varO to the current tuple 
Encoding : 8 

5 Meaning: terminates the current tuple, causes it to be 

returned to the netbot 

6. Parsing Instructions: 

Encoding: 9 <varO> <varl> <address of the finite state 
machine for <nt» 
10 Meaning: varO is assigned the result of the parse of 

varl according to the regular expression denoted by 

<nt> 

7. Exit instructions: 

Encoding : 10 

15 Meaning: exit* the current action language block 

in alternative embodiments, different intermediate codes 
or even no intermediate code can be used. For example, 
instead of previously disclosed address-based intermediate 
20 code, equivalents a reverse Polish stack-based intermediate 
code could be used. Optionally, no intermediate code is 
used, and action language statements are compiled directly to 
a machine language. Both these alternative can be 
implemented by using the disclosed syntax-directed methods. 

25 See, e.g., Aho et al. . rrerf 

Execution of wrapper actions is performed in a preferred 
embodiment in several steps. First, memory is allocated to 
store all variables and variable values. The number of 
variables is known after the action statement have been 
30 parsed. All variables are initialized to an ™ asS ^ n f 

value. Next, a pointer CURRENT JCNSTRUCTION is initialized to 
point to the first instruction of the bytecode to be 
executed. Then, the interpreter enters a loop wnl£ * 
retrieves the bytecode pointed to by CURRENT_INSTRUCTI0N , 
35 performs the coded simple action according to the bytecode 
language previously described, and updates 

CURRENT_INSTRUCTION to point either to the next instruction, 
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or for taken branch instructions, is modified by the offset 
value. The interpreter finishes execution upon encountering 
an exit instruction. 

The parse built-in function is executed at run-time by 

5 calling the FSA interpreter to interpret the compiled regular 
expression code specifying the finite state machine. 
Output (a_l,...,a_n) statements are translated and then 
executed by executing bytecode instruction code 7 for each 
variable in the output list followed by a bytecode 

10 instruction code 8 that terminates the current tuple, i.e., 
marks the end of the output statement. 

6 . BPKCIFIC EM«r»DTMEKTS, CITATION OF REFERENCES 

The present invention is not to be limited in scope by 
is the specific embodiments described herein. Indeed, various 
modifications ol the invention in addition to those described 
herein will become apparent to those skilled in the art from 
the foregoing description and accompanying figures. Such 
modifications are intended to fall within the scope of the 

2 0 appended claims. 

various publications are cited herein, the disclosures 
of which are incorporated by reference in their entireties. 
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WHAT IS CLAIMED IS: 



1. 



10 



la 



20 



2. 



25 3 



30 



35 



A method for assisting a user to query for information 
available from information sources attached to a 
network, said method comprising: 

a . selecting the one or more information sources most 
relevant to a user query; 

b formatting said user query for each said relevant 
information source according to a description of each 
said relevant information source written in a wrapper 
description language; 

c. transmitting said formatted query to each of said 
relevant information sources; 

d. extracting data fields relevant to said user query 
fi-om responses returned from said relevant information 
sources, said extracting according to said description 
of said relevant information source returns, each said 

~« —ant data fields to said user. 

The method of claim 1 further comprising, after said 
extracting step, a step of estimating the relevance of 
said responses to said user query. 

Th e method of claim 1 where said selecting step further 
comprises the steps of: 

a . determining one or more concepts to which saxd user 

query relates; and sa id 
7 selecting an information source as relevant if said 

information source relates to said concepts. 

The method of claim 1 wherein said wrapper ^Jf*™ 
' language comprises facilities for the specification of 
„ Lire regular expression in terms of component 

emch that, upon recognition of one 
regular expressions sucn tn«v, * 

h f said en tire regular 

component regular expression bound 

^ ac tions can be executed with variables boun 
expression, actions can 
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as of the time of recognition of said component regular 
expression, said actions actually being executed only if 
said entire regular expression is also recognized. 

5 5. A system for assisting a user to query for information 
available from information sources attached to a 
network., said system comprising: 

a. a first processor means for selecting the one or 
more information sources most relevant to a user query; 

10 b . a second processor means for formatting said user 

query for each said relevant information source 
according to a description of each said relevant 
information source written in a wrapper description 
. language, and for transmitting said formatted query to 

15 each of said relevant information sources; 

c. a t.iird processor means for extracting data fields 
relevant to said user query from responses returned from 
said relevant information sources, said extracting 
according to said description of said relevant 

20 information source returning each said response; and 

d. a fourth processor means for presenting said 
relevant data fields to said user. 

6. The system of 5 wherein said first, said second, said 
2 5 third, and said fourth processor means reside in one 

network attached computer, said computer comprising a 
CPU and memory. 

7. The system of 5 wherein said first, said second, said 
30 third, and said fourth processor means reside in 

separate network attached computers, each said computer 
comprising a CPU and memory 

» 

8. The system of 5 wherein said fourth processor means 

35 presents said relevant data fields by executing a World 

Wide Web browser. 



- 65 - 



WO 98/12881 



PCT/US97/17132 



9. The system of 5 further comprising a first network 

attached memory means for storing the descriptions of 
said information sources and for providing said 
descriptions upon request of said second or third 
5 processor means . 

10. A wrapper description language method comprising 

recognizing a string by scanning said string according 
to a regular expression written in a regular expression 
10 language, said regular expression language capable of 

specifying an entire regular expression in terms of 
component regular expressions such that, upon 
recognition of one component regular expression of said 
entire regular expression, actions can be executed with 
1S variables bound as of the time ot recognition of said 

component regular expression, said actions actually 
being executed only if said entire regular expression is 
also recognized. 

20 11. The wrapper description language method of claim 10 
further comprising the step of executing one or more 
statements written in an action language, said action 
language capable of specifying the actions of operators 
on variables, the actions of functions on variables and 
25 of the sequence of statements, the statements of said 

action language being further capable of being embedded 
in said regular expression language for specifying said 
actions occurring on the recognition of said component 
regular expressions. 

12. A computer readable medium containing instructions for 
causing one or more computers to perform the method of 
claim 1. 

A computer readable medium containing instructions for 
causing one or more computers to perform the method of 
claim 10. 



30 



35 13. 
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